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Abstract 


High-content screening uses large collections of unlabeled cell image data to rea¬ 
son about genetics or cell biology. Two important tasks are to identify those cells 
which bear interesting phenotypes, and to identify sub-populations enriched for 
these phenotypes. This exploratory data analysis usually involves dimensionality 
reduction followed by clustering, in the hope that clusters represent a phenotype. 
We propose the use of stacked de-noising auto-encoders to perform dimensionality 
reduction for high-content screening. We demonstrate the superior performance 
of our approach over PC A, Local Linear Embedding, Kernel PC A and Isomap. 


1 Introduction 

The use of machine learning methods to apply phenotype labels to cells has proven an effective 
way to uncover novel roles of genes in model organisms [ Vizeacou mar et al., 2010) as well as in 
human [Fuchs et al., 2010| . In an unsupervised model, characterizing all cells split across distinct 
populations presents a clustering problem over many millions of high dimensional data points. This 
complicates the clustering problem, and it is preferable to transform the data into a much lower 
dimensional space. How to best perform the dimensionality reduction is far from clear. Particularly 
important qualities in this use case are the ability to scale to millions of data points, and the flexibility 
to model non-linear relationships between covariates in the map from higher to lower dimensional 
space. Standard algorithms for dimensionality reduction fail in one or both of these criteria. 

• PCA is fast once the covariance matrix is formed, but is not able to model any non-linear 
interactions. 

• Kernel PCA [Ham et al., 2004) can model more flexible relationships, but is impractical 
for large data sets due to the growth of the kernel matrix. 

• Local Linear Embedding (LLE) [Roweis, 20 00] and Isomap [De Silva et al., 20031 are im¬ 
practical for large data sets as they require an large matrix decomposition that scales cubi- 
cally in the number of data points. 

Neural networks structured as auto-encoders ( [Hinton and Salakhutdinov, 2006) ) can model non¬ 
linear interactions, and scale well to large data sets. They can also be trained with unlabeled data, 
which in the case of high-content screening is plentiful. We investigated how well a class of auto¬ 
encoders ( [Vincent et al., 20081 ) could be composed to create an efficient and flexible dimensionality 
reduction algorithm. 
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2 Stacked De-noising Autoencoders 


The idea of composing simpler models in layers to form more complex ones has been suc¬ 
cessful with a variety of basis models, stacked de-noising autoencoders (abbrv. SdA ) being 
one example [ Hint on and Sa lakhutdinov, 2006, Ranzat o~et al., 2008) | Vincent et al., 20T0] [. These 
models were frequently employed as unsupervised pre-training ; a layer-wise scheme for ini¬ 
tializing the parameters of a multi-layer perceptron, which is subsequently trained by minimiz¬ 
ing an appropriate loss function over real versus model-predicted labels of the data. Where 
labeled data is plentiful, supervised training now reigns supreme, with pre-training overlooked 
in favour of randomly initialized models coupled with powerful new regularization techniques 
[ Srivastava , 2013[ |Goodfello w et al., 20 13]. These methods are a poor fit for high-content screening, 
where biological expertise and time-intensive sampling makes labels scarce. Unlabeled data remains 
plentiful, so pre-training is well-suited to this scenario. We chose de-noising autoencoders since they 
are simple to train without incurring any sacrifice in competitive performance [Vincent et al., 2010) . 

2.1 De-noising Autoencoders 

The de-noising autoencoder [Vincent et al., 2008) takes input data x E 5R d , and maps it to a hidden 
representation?/ E [0, l] n , n <C d through a corrupted noisy mapping y = fo(x) = sigmoid(Wx + 
b ), parametrized by 0 = {XT 7 , b}. W is a n x d weight matrix and b is a bias vector. The input x 
is corrupted into x by randomly setting a certain proportion of the values of x to zero. The latent 
representation y is then mapped back to a reconstructed vector z E via z = ge> ( y ) = W'y + b' 
with 0' = {W', b'}. The parameters W, d are set to minimize the reconstruction loss L(x,z). See 
the schematic in figure [I] 
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Figure 1: Schematic representation of a de-noising autoencoder (abbrv. dA). The Stacked de-noising 
autoencoder model is a product of layering dA models such that the hidden layer of one model is treated as the 
input layer of the next dA. The hidden layer of the top-most dA represents the reproduced data. 


2.2 Training 


To find the best architecture of layer sizes, we performed another grid search over both layer sizes 
as well as layers. Hinton and Salakhutdinov [Hinton and Salakhutdinov, 2006 ] used a four layer 
auto-encoder for MNIST, but we began with 3 layer networks and tried both 4 and 5 layer networks. 
Both 3 and 4 layer networks achieved better results measured by reconstruction error on a validation 
set. For a given number of layers, each candidate model was a point on a grid with a step size of 100: 
700 ... 1000 x 500 ... 900 x 100 ... 400 x 50 ... 10. Each dA layer in every model was pre-trained 
using mini-batch stochastic gradient descent for 50 epochs, in batches of 100 samples. The mini¬ 
mum mean reconstruction error for each layer was recorded. After selecting the top 5 performing 
models for both 3 and 4 layer networks, we performed grid searches over each of the tunable hyper¬ 
parameters momentum, noise rate, learning rate, and weight decay. We used Adagrad to adjust the 
learning rate appropriately for each dimension [Duchi et al., 2010) . Tuning the initial value of the 
base learning rate had the largest effect on performance. The other parameters showed much smaller 
effects. Both the architecture and hyper-parameter searches were performed in parallel on a cluster 
with GPU capable nodes. All GPU code was written in python using Theano [ Bergstra et al., 2010) . 
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3 Experiments 


We used data derived from images of yeast cell strains transformed to express a GFP fusion protein 
acting as a marker for DNA damage foci. Each well in a 384 well plate contained an homozygous 
population of yeast bearing some set of genetic deletions. Four field of view images per well were 
captured, each containing several hundred cells. Post capture, the images were run through an 
image processing pipeline using CellProfiler, to segment the cells in each field, and represent each 
as a vector of intensity, shape and texture features [ [Kamentsky et al., 2011| . The cellphenotypes in 
this study consisted of three classes: dna-damage foci, non-round nuclei, or wild-typq^] A validation 
set of approximately 10000 labeled cells was generated by first manually labeling images for each 
phenotype, and then training an SVM in a one vs all manner to label the remaining cells. The 
predicted labels were manually validated. 


The data for the validation set was scaled to have mean 0 and variance 1. The scaled data was reduced 
using one of the candidate algorithms, followed by Gaussian mixture clustering using scikit-learn 
[Pedregosa et al., 2011| . The number of mixture components was chosen as the number of labels 
in the validation set. All hyperparameters were chosen by cross-validation. For each model and 
embedding of the data, a Gaussian Mixture Model was run with randomly initialized parameters for 
10 iterations. Each of these runs was repeated 10 times. The parameters from the best performing run 
were used to initialize a GMM which was run until convergence, and the homogeneity was measured 
and recorded. This process was repeated five times for each model, resulting in five homogeneity 
measurements. We report the average homogeneity (fit by loess) as well as the standard deviation [2] 
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Figure 2: Homogeneity test results of 3, 4 layer SdA models versus comparators. For each model, the data 
was reduced from 916 dimensions down to 50.. 10 dimensions. For each model and embedding of the data, a 
Gaussian Mixture Model was run with randomly initialized parameters for 10 iterations. Each of these runs 
was repeated 10 times. The parameters from the best performing run were used to initialize a GMM which 
was run until convergence, and the homogeneity was measured and recorded. This process was repeated five 
times for each model, resulting in five homogeneity measurements. We report the average homogeneity (solid 
line) as well as the standard deviation (shadowed gray). 


4 Discussion 

Examination of figure [2] shows that SdA models were consistently ranked as the best models for 
dimensionality reduction in preparation for phenotype based clustering. There is a loss of homo¬ 
geneity incurred by using an algorithm that cannot learn non-linear transformations, as shown in the 
gap between Isomap and PCA. To try and understand what accounts for the gap between SdA models 
and the other algorithms, we randomly sampled output from each algorithm and computed estimates 

1 Yeast cells that displayed neither DNA-damage foci nor a non-round nucleus phenotype 
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of the distribution over both the inter-label and intra-label distances between points. Shown in figure 
[3]is an example of the estimated distance distributions between kernel PCA and a sample three layer 
SdA model. While the points sampled from kernel PCA have a smaller intra-label mean distanc^J 
the points sampled from the SdA model have a larger inter-label distanc^] Therefore, a clustering 
algorithm applied to SdA reduced data was more reliably able to find assignments that reflected the 
phenotype labels. 
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Figure 3: Sampling from distributions over distances of data points reduced to 10 dimensions by both SdA, 
and kernel PCA. While the SdA models do not always produce the tightest packing of points within classes 
(see left), they do consistently assign the widest mean distances to inter-class points. 


We have introduced deep autoencoder models for dimensionality reduction of high content screening 
data. Mini-batch stochastic gradient descent allowed us to train using millions of data points, and 
the nature of the model allowed us to apply the resulting models to unseen data, circumventing the 
limitations of other comparable dimensionality reduction algorithms. We also demonstrated that 
SdA models produced output that was more easily assigned to clusters that reflected biologically 
meaningful phenotypes. 
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